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Abstract 

In this paper, we develop optimal policies for deciding when a wireless node with radio frequency 
(RF) energy harvesting (EH) capabilities should try and harvest ambient RF energy. While the idea of 
RF-EH is appealing, it is not always beneficial to attempt to harvest energy; in environments where the 
ambient energy is low, nodes could consume more energy being awake with their harvesting circuits 
turned on than what they can extract from the ambient radio signals; it is then better to enter a sleep 
mode until the ambient RF energy increases. Towards this end, we consider a scenario with intermittent 
energy arrivals and a wireless node that wakes up for a period of time (herein called the time-slot) and 
harvests energy. If enough energy is harvested during the time-slot, then the harvesting is successful and 
excess energy is stored; however, if there does not exist enough energy the harvesting is unsuccessful 
and energy is lost. 

We assume that the ambient energy level is constant during the time-slot, and changes at slot 
boundaries. The energy level dynamics are described by a two-state Gilbert-Elliott Markov chain model, 
where the state of the Markov chain can only be observed during the harvesting action, and not when in 
sleep mode. Two scenarios are studied under this model. In the first scenario, we assume that we have 
knowledge of the transition probabilities of the Markov chain and formulate the problem as a Partially 
Observable Markov Decision Process (POMDP), where we find a threshold-based optimal policy. In the 
second scenario, we assume that we don’t have any knowledge about these parameters and formulate 
the problem as a Bayesian adaptive POMDP; to reduce the complexity of the computations we also 
propose a heuristic posterior sampling algorithm. The performance of our approaches is demonstrated 
via numerical examples. 
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Fig. 1. In radio frequency energy harvesting, the device that is not the destination of the packet can capture RF radiation of the 
wireless transmission from cellular communication, WiFi or TV towers, and convert it into a direct current through rectennas. 


Index Terms 


Energy harvesting, ambient radio frequency energy. Partially Observable Markov Decision Process, 
Bayesian inference. 


I. Introduction 

In green communications and networking, renewable energy sources can replenish the energy 
of network nodes and be used as an alternative power source without additional cost. Radio 
frequency (RF) energy harvesting (EH) is one of the energy harvesting methods that have recently 
attracted a lot of attention (see, for example, |T]-[;3j and references therein). In RF-EH, a device 
can capture ambient RF radiation from a variety of radio transmitters (such as television/radio 
broadcast stations, WiFi, cellular base stations and mobile phones), and convert it into a direct 
current through rectennas |4j, see Figure [1] It has been shown that low-power wireless systems 
such as wireless sensor networks with RF energy harvesting capabilities can have a significantly 
prolonged lifetime, even to the point where they can become self-sustained and support previously 
infeasible ubiquitous communication applications [0. 

However, in many cases the RF energy is intermittent. This can be due to temporary inac¬ 
tive periods of communication systems with bursty traffic or/and multi-path fading in wireless 
channels 0. Moreover, the energy spent by wireless devices to wake up the radio and assess 
the channel is non-negligible. Hence, when the ambient energy is low, it is energy-inefficient 
for a node to try and harvest energy and better to sleep. The challenge in the energy harvesting 
process lies in the fact that the wireless device does not know the energy level before trying to 
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harvest. For this reason, it is crucial to develop policies when a wireless node should harvest or 
sleep to maximize the accumulated energy. 

In this paper, we study the problem of energy harvesting for a single wireless device in an 
environment where the ambient RF energy is intermittent. Energy harvesting with intermittent 
energy arrivals has been recently investigated under the scenario that the energy arrivals are 
described by known Markov processes [7]-[ll|. However, the energy arrivals may not follow 
the chosen Markov process model. It is therefore necessary not to presume the arrival model, 
but allow for an unknown energy arrival model. Towards this direction, this problem has only 


been targeted via the classical Q-leaming method in [12|. The Robbins-Monro algorithm, the 


mathematical cornerstone of Q-learning, was applied in [ 131 to derive optimal policies with a 
faster convergence speed by exploiting the optimal policy is threshold-based. However, both 
the Q-learning method and the Robbins-Monro algorithm rely on heuristics (e.g., e-greedy) to 


handle the exploration-exploitation trade-off [ 14]. The optimal choice of the step-size for the best 
convergence speed is also not clear; only a set of sufficient conditions for asymptotic convergence 
is given. 

All the aforementioned works assume that the energy arrival state is known at the decision 
maker, before the decision is taken. This is an unrealistic assumption since it does not take 
into account the energy cost for the node to wake up and track the energy arrival state, while 
being active continuously can be detrimental in cases of low ambient energy levels. The partial 
observability issues in energy harvesting problems have only been considered in scenarios such 
as the knowledge of the State-of-Charge CD. the event occurrence in the optimal sensing 
problem [16], and the channel state information for packet transmissions | [T7| |. To the best of our 
knowledge, neither the scenario with partial observability of the energy arrival nor this scenario 
coupled with an unknown model have been addressed in the literature before. 

Due to the limited energy arrival knowledge and the cost for unsuccessful harvesting, the 
fundamental question being raised is whether and when it is beneficial for a wireless device to 
try and harvest energy from ambient energy sources. In this paper, we aim at answering this 
question by developing optimal sleeping and harvesting policies that maximize the accumulated 
energy. More specifically, the contributions of this paper are summarized as follows. 

• We model the energy arrivals using an abstract two-state Markov chain model where the 

node receives a reward at the good state and incurs a cost at the bad state. The state of the 
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model is revealed to the node only if it chooses to harvest. In absence of new observations, 
future energy states are predicted based on knowledge about the transition probabilities of the 
Markov chain. 

We propose a simple yet practical reward function that encompasses the effects of the decisions 
made based on the states of the Markov chain. 

We study the optimal energy harvesting problem under two assumptions on the parameters of 
the energy arrival model. 

1) For the scenario where the parameters are known, we formulate the problem of whether 
to harvest or to sleep as a Partially Observable Markov Decision Process (POMDP). We 
show that the optimal policy has a threshold structure: after an unsuccessful harvesting, 
the optimal action is to sleep for a constant number of time slots that depends on the 
parameters of the Markov chain; otherwise, it is always optimal to harvest. The threshold 
structure leads to an efficient computation of the optimal policy. Only a handful of papers 
have explicitly characterized the optimality of threshold-based policies for POMDP (for 


example, [18], [ 191 ) and they do not deal with the problem considered in this work. 

2) For the scenario when the transition probabilities of the Markov chain are not known, we 
apply a novel Bayesian online-learning method. To reduce the complexity of the computa¬ 
tions, we propose a heuristic posterior sampling algorithm. The main idea of Bayesian online 
learning is to specify a prior distribution over the unknown model parameters, and update 
a posterior distribution by Bayesian inference over these parameters to incorporate new 
information about the model as we choose actions and observe results. The exploration- 
exploitation dilemma is handled directly as an explicit decision problem modeled by an 
extended POMDP, where we aim to maximize future expected utility with respect to the 
current uncertainty on the model. The other advantage is that we can define an informative 
prior to incorporate previous beliefs about the parameters, which can be obtained from, 
for example, domain knowledge and field tests. Our work is the first in the literature that 


introduces and applies the Bayesian adaptive POMDP framework [20] in energy harvesting 
problems with unknown state transition probabilities. 

The schemes proposed in this paper are evaluated in simulations and significant improvements 
are demonstrated compared to having the wireless nodes to harvest all the time or try to harvest 
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randomly. 

The rest of this paper is organized as follows. The system model and the energy harvesting 
problem are introduced in Section [IT} In Section III we address the case of known Markov 
chain parameters, and using POMDP we derive optimal sleeping and harvesting policies; the 


threshold-based structure of the optimal policies are also shown. In Section IV we address the 
case of unknown Markov chain parameters and we propose the Bayesian on-line learning method. 
Numerical examples are provided in Section |VJ Finally, in Section [VI] we draw conclusions and 
outline possible future research directions. 


II. System Model 

We consider a single wireless device with the capability of harvesting energy from ambient 
energy sources. We assume that the overall energy level is constant during one time-slot, and may 


change in the next time-slot according to a two-state Gilbert-Elliott Markov chain model [21), 
1; see Fig. [2] In this model, the good state (G) denotes the presence of energy to be harvested 



Fig. 2. Two-state Gilbert-Elliott Markov chain model. 


and the bad state ( B ) denotes the absence of energy to be harvested. The transition probability 
from the G state to B state is p, and the transition probability from B state to G state is q. The 
probabilities of staying at states G and B are 1 — p and 1 — q, respectively. It can be easily 
shown that the steady state distribution of the Markov chain at B and G states arc p/(p + q) 
and q/(p + q), respectively. 

At each time-slot, the node has two action choices: harvesting or sleeping. If the node chooses 
to harvest and the Markov chain is in the G state, a reward rr > 0 is received that represents 
the energy successfully harvested. If the Markov chain is in the B state during the harvesting 
action, a penalty —r 0 < 0 is incurred that represents the energy cost required to wake up the 
radio and try to detect if there exists any ambient energy to harvest. On the other hand, if the 
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node sleeps, no reward is received. Therefore, the reward function is defined as 


f ry, a = EL A s 


R(s, a) = < 


~r o, 


a = EL A s 


G, 
B , 


^ 0, a = S, 


( 1 ) 


where a denotes the harvesting action (EL) or the sleeping action (5), and s is the current state 
of the Markov chain. 


Remark 1. Note that one could impose a cost for sleeping. However, this does not change the 
problem setup since we could normalize the rewards and costs so that the sleeping cost is zero. 

Remark 2. In addition, the choice of the exact numbers for r 0 and ry depend on hardware 
specifications, such as the energy harvesting efficiency and the energy harvesting cost. Even 
though in reality the energy harx’ested and hence the reward ry is not fixed, the choice of ry 
can be seen as the minimum or average energy harvested during a time-slot. Similarly, r 0 can 
be seen as the maximum or average energy spent during a slot when the node failed to harvest 
energy. 

The state information of the underlying Markov chain can only be observed by the harvesting 
action, but there is a cost associated with an unsuccessful energy harvesting. On the other 
hand, sleeping action neither reveals the state information nor incurs any cost. Thus, it is not 
immediately clear when it is better to harvest to maximize the reward. Furthermore, the transition 
probabilities of the Markov chain may not be known a priori, which makes the problem of 
maximizing the reward even more challenging. 

Let at G {EL,S} denote the action at time t, s t denote the state of the Markov chain at time 
t, and z t G {G, B.Z\ denote the observation at time t where Z means no observation of the 
Markov chain. Let a* = {a 0 , ai,..., a t } denote the history of actions and z l = {^ 0 , z i, • • •, z t} 
denote the history of observations. A policy n is a function that randomly prescribes an action 
at time t based on the history of actions and observations up to time t — 1. The goal is then to 
find the optimal policy n* that maximizes the expected total discounted reward, 

OO 

7r* G argmaxE x [V 7 t R t (s t ,af)\, 

IX < J 

t =0 
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where R t is the reward at time t and the expectation is taken with respect to the randomization 
in the policy and the transitions of the Markov chain. The discount factor 7 e [0,1) models the 
importance of the energy arrivals at different time slots in which the energy harvested in the 
future will be discounted. The discount factor can also be seen as a scenario where the node 
terminates its operation at each time-slot independently with probability (1 — 7) [23]. 


III. Optimal structured policy with unknown Markovian states 

In this section, we first solve the problem of deriving the optimal policy with known transition 
probabilities and unknown Markovian states by formulating it as a Partially Observable Markov 


Decision Process (POMDP) [24]. We further show that the optimal policy has a threshold-based 
structure. This structural result simplifies both the off-line computations during the design phase 
and the real-time implementation. 


A. POMDP formulation 

Although the exact state is not known at each time-slot, we can keep a probability distribution 
(i.e., belief) of the state based on the past observations and the knowledge on the Markov chain. 
It has been shown that such a belief is a sufficient statistic [24], and we can convert the POMDP 
to a corresponding MDP with the belief as the state. 

Let the scalar b denote the belief that the state is good (i.e., G) at the current time-slot. If the 
action is to harvest at the current time-slot, in the next time-slot the belief can be either bs — q 
or bo — 1 — p depending on the harvesting result. If the action is to sleep, the belief is updated 
according to the Markov chain, i.e., 

b' = (1 - p)b + < 7(1 - b) = q + (1 - p - q)b, ( 2 ) 

which is the probability of being at good state at the next time-slot given the probability at 
the current time-slot. This update converges to the stationary distribution of the good state. In 



summary, we have the following state transition probability 


P (b'\a,b) 


b if a — H, b' — b G , 

1 — 6 if a — R, V — bs, 

< 

1 if a = S, b' = q + (1 — p — q)b , 
0 otherwise. 


We let 1 — p > q, which has the physical meaning that the probability of being at G state is 
higher if the state at the previous time is in G state other than in B state. Please let me know if 
the grammar of this sentence is correct or not. Hence, the belief b takes discrete values between 
q and 1 — p, and the number of belief is infinite but countable. 

By Equation ( 1|), the expected reward with belief b is 


R(b, a) = bR( 1, a) + (1 — b)R( 0, a) 


{ (r 0 + n)b - r 0 , a = H, 
0, a = S. 


Any combination of the action history a 4 and the observation history z l corresponds to a 
unique belief b. Hence, the policy 7r is also a function that prescribes a random action a for the 
belief b. The expected total discounted reward for a policy 7r starting from initial belief b 0 , also 
termed as the value function, is then 


y"(6 0 ) = E^ 7 ^(6 t ,a t )]. 

t =o 


Since the state space is countable and the action space is finite with only two actions, there 


exists an optimal deterministic stationary policy 7r* for any b [23 Theorem 6.2.10] such that 


7T* G arg max V n (b). 

7T 
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B. Optimal policy - value iteration 

Let V* = V n * be the optimal value function. The optimal policy can be derived from the 
optimal value function, i.e., for any b, we have 


7 r*(6) G arg max [R(b, a) + 7 ^ P(b'\a, b)V*(b'j\. 


The problem of deriving the optimal policy is then to compute the optimal value function. It 


is known that the optimal value function satisfies the Bellman optimality equation [23 Theorem 
6.2.5], 

V*(b) = max [R{b, a) + 7 ^ P(6'|a, b)V*{b ')], 

a£{H,S} 


and the optimal value function can be found by the value iteration method shown in Algorithm [T| 
The algorithm utilizes the fixed-point iteration method to solve the Bellman optimality equation 
with stopping criteria. If we let t —*■ 00, then the algorithm returns the optimal value function 

V*(b) (23). 


Algorithm 1: Value iteration algorithm 


Input: Error bound e 

Output: V(b) with sup & | V(b) — V*(b) \ < e/2. 

1 Initialization: At t — 0, let Vo(b) = 0 for all b 

2 repeat 

3 Compute V t+ i (b) for all states b. 


V t+ i(b) = max [_R(b, a) + 7 ^ P(b'\a, b)V t (b ')\. 

b ' 

Update t — t + 1. 

4 until sup b \V t+1 (b) - V t (b)\ < e(l - 7)/ 2 7- 


C. Optimality of the threshold-based policy 

Let V t+ i(b, a) denote the value function of any action a G {R.S} in Algorithm [lj and let 
Uoo (b, a) = lini/^oc VfJ). a). We first show that the optimal policy has a threshold structure: 


b = mm{V 00 (b,'H) > Voo(b,S)}. 

b 


Proposition 1. Define 
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If the threshold b > q/(p + q), then the optimal policy is to never harvest. Ifb < q/ (p + q), then 
the optimal policy is to continue to harvest after a successful harx’esting time slot, and to sleep 
for 


N = 


l°gl -p-q 


q - {p + q)b 
Q 


- 1 


time slots after an unsuccessful harvesting. 


Proof: The proof relies on two Lemmas presented in the end of this section. We first prove 
that the optimal action is to harvest for any belief b > b and to sleep for any belief b <b. From 
the definition of b, it is clear that it is always optimal to sleep for belief b <b. From Equation ([6]) 
and Equation ([7]), we have that 


VociPi'H') CXh,o o + fih.oob) 


Voo(b,S)= max {a s +/3 s b} 

{q;s ,/3 s }<Er s ,oo 


where r SiOC = {y(a + Pq), 7/5(1 - p - q) : V{a, P} G Too}, and r M = T S:OC (JW,oo, Ph,oo}- Let 
B s ,oo = {P : {«, P} e r Si0C } and = B Sj00 (J B Koo . Hence, every p value in B S)OQ is generated 
by a scaling factor 7(1 — p — q) from the set B^. Since 7(1 —p — q ) is strictly smaller than one 
and P > 0 from Lemma [2j we have that f3h,oo > max{/3. s } by the proof of contradiction. Since 
V 00 (b,'H) > Voc(b,S), it follows that V 00 {b,'H) > (b,S) for any b > b. 

Observe that after an unsuccessful harvesting and sleeping additionally for t — 1 time slots, 
the belief b is 


t -1 

-p - q ) 1 

i =0 


1 — (1 — p — qY 

q - 7 - 

p + q 


Since 1—p — q G (0,1), this is monotonically increasing with t and converges to q/ ( p + q ). The 
proposition follows by deriving t such that the belief is larger than the threshold b. ■ 

Proposition |T| suggests that we can focus on the set of policies with threshold-structure, which 
is a much smaller set than the set of all policies. This leads to an efficient computation of the 
optimal policy shown in Proposition [2j 


Proposition 2. Let b' = q[l — (1— p—q) n+1 ]/(p+q), let F(n) = y n+1 ri[b'— l+p)+ri~p(r 0 +ri), 
and let G(n ) = 7 n+1 (6'(l — 7) — (1 — 7 + 7p)) +1 — 7 + 7P. The optimal policy is to continuously 
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harvest after a successful han’esting, and to sleep for 

AT * F ( n ) 

AT = argmax nG{0)li , } — 
time slots after an unsuccessful harvesting. 


Proof: Let n n denote the policy that sleeps n time slots after bad state observation, and 
always harvests after good state observation. By Proposition [T] the optimal policy is a type of 
7r n policy, and we need to find the optimal sleeping time that gives the maximum reward. 

Recall that the belief after good state observation is 1 — p, and after bad state observation is 
q. The belief after bad state observation and sleeping n time slots is 


if a V''W 1 y 1 ~ (1 — p — q ) n+1 

b = ~ p ~ q > = q — 


i =0 


p + q 


At belief q, the tt" policy is to sleep for n time slots, and thus 


V” n (q) =1 n V^(h'). 


(3) 


At belief 1 — p, the n n policy is to harvest, and thus 

V^l - p) = (1 - p){r 0 + n) - r 0 

+ 'ypV* n (q) + 'y(l**p)V* n (l-~p). (4) 

At belief b', the tt" policy is also to harvest, and thus 

V* n {b') = b'{r 0 + r 1 )- r 0 

+ 7 n+1 (l - b')V* n (b') + 7 6V^(1 - p). (5) 

By solving the above Equations Q-Q-Q, V' 7r ”(l — p) corresponds to F(n)/G{n). Hence, N 
is the optimal sleeping time that gives the maximum reward within the set of policies defined 
by 7 r n . Since the optimal policy has this structure, the proposition is then proved. ■ 


Lemma 1. The value function Vt(b ) in the value iteration algorithm at any time t is a piecewise 
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linear convex function over belief b, i. e ., 


14(6) = max \a + /3b}, 

{a,/3}er t cR 2 

where the set T t is computed iteratively from the set Y ,_i with the initial condition T 0 = {0,0}. 


Proof: We prove the lemma by induction on time t. The statement is correct when t — 0 
with T 0 = (0, 0} since Vo(6) = 0 for all b. Suppose the statement is correct for any t. The value 
function of sleeping action at time t + 1 is 


Define 


V t+1 (b,S)±'yV t (q + b{l-p-q)) 

= 7 max {a + j3(q + 6(1 — p — g))} 

(a,/3}er t 

= max {y(a + /3q) + by/3(l — p — q)} 
(a,/3}er t 


r s , t +i = {t(« + N),i Pif-p-q)- V{«, p} e rj, 

a s = 7(a + Pq), 

Ps = 'yP(l-p-q). 


Hence, we have 


14+i (6, S) — max {a s + fi s b}. 

{a s ,/3 s }er Sit+ i 


( 6 ) 


The value function of the harvesting action is 


14+i(6, H) = (ro + rf)b - r 0 + yV t (b B )(l - 6) + yV t (b G )b 
= ~r o + 7 V t (b B ) + (r 0 + n + y{V t {b G ) - V t (b B )))b. 


Define 


<*h,t — -^o + 7 V t (b B ), 

Ph,t - r o + r i + 7 (V t (b G ) - V t (b B )). 
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We then have 


V t+ i {b,%) = a h>t + f} h , t b. (V) 

Since V t+ \(b) = max{V t+ i(b,S),V t+ i(b,'H)}, the statement is proved by defining T t +i = 

Ph,t \U r'sjt+l - ® 

Lemma 2. For any t, ifb\ > b 2 , then V t {b \) > V t {b 2 ). For any {a, j3} € T t , we fie/ve j3 > 0. 

Proof: We prove the proposition by induction on time t. Since V 0 (b) = 0 for all b at time 
t = 0 and T 0 = {0,0}, the statement is correct at time t — 0. Suppose the statement is correct 
at time t. Since 1 — p — q > 0 and f3 > 0, we have that 

7(« + Pq) + biy/3(l - p - q) > 7(a + @q) + b 2 yf3(l -p-q ). 

By Equation ([6]), we have V t+ i(bi,S) > V t+ \{b 2 ,S). Since be > b B , we also have V t {b G ) > 
V t (b B ) by the induction condition. By Equation 0, we have V t+l (b], H) > V t+ \(b 2 ,FL). Hence, 
we have that V t+1 (bi) > V t+1 (b 2 ). Similarly, we can also derive that f3 > 0 for any {«, G r t+1 . 


IV. Bayesian online learning unknown transition probabilities 

In many practical scenarios, the transition probabilities of the Markov chain that model the 
energy arrivals may be initially unknown. To obtain an accurate estimation, we need to sample 
the channel many times, a process which unfortunately consumes a large amount of energy and 
takes a lot of time. Thus, it becomes crucial to design algorithms that balance the parameter 
estimation and the overall harvested energy; this is the so-called exploration and exploitation 
dilemma. Towards this end, in this section, we first formulate the optimal energy harvesting 


problem with unknown transition probabilities as a Bayesian adaptive POMDP [20]. Next, we 
propose a heuristic posterior sampling algorithm based on the threshold structure of the optimal 
policy with known transition probabilities. The Bayesian approach can incorporate the domain 
knowledge by specifying a proper prior distribution of the unknown parameters. It can also strike 
a natural trade-off between exploration and exploitation during the learning phase. 
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A. Models and Bayesian update 

The Beta distribution is a family of distributions that is defined on the interval [0,1] and param¬ 
eterized by two parameters. It is typically used as conjugate prior distributions for Bernoulli dis¬ 
tributions so that the posterior update after observing state transitions is easy to compute. Hence, 
for this work, we assume that the unknown transition probabilities p and q have independent 
prior distributions following the Beta distribution parameterized by 0 = [0! o 2 03 0 4 ] T G Zl, 
i.e., 


P(P, <?; 0) = P(P, G 01, 02, 03, 04) 

= P(p;0i,02)P(g;03,04), 

where (a) stems from the fact that p and q have independent prior distributions. The Beta 
densities of probabilities p and q are given by 

p (p; 01 , 02 ) = “ p) 02-1 ’ 

r( 0 i)r( 0 2 ) 

= r , ( 00 ( t 4 0 fe "‘( 1 - 9 )* 4 ' 1 . 

r( 0 3 )r( 0 4 ) 

respectively, where T(-) is the gamma function, given by T(y) = J 0 °° x y ~ 1 e~ x dx. However, for 
y G Z + (as it is the case in our work), the gamma function becomes T(y) = (y — 1)!. 

By using the Beta distribution parameterized by posterior counts for p and q, the posterior 
update after observing state transitions is easy to compute. For example, suppose the posterior 
count for the parameter p is 0i = 5 and o 2 = 7. After observing state transitions from G to B 
(with probability p) for 2 times and state transitions from G to G (with probability 1 — p) for 3 
times, the posterior count for the parameter p is simply 0\ =5 + 2 = 7 and (j> 2 = 7 + 3 = 10. 
Without loss of generality, we assume that 0 initially is set to [1,1,1,1] to denote that the 
parameters p and q are between zero and one with equal probabilities. 

Note that we can infer the action history a t from the observation history z t . More specifically, 
for each time t, if z t = Z. then a t = S, and if z t G {G. B}, then a t = H. In what follows, we 
use only the observation history z l for posterior update for the sake of simplicity. Consider the 
joint posterior distribution P(s t ,p, qjz 4 ” 1 ) of the energy state s t and the transition probability p 
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and q at time t from the observation history z t_1 . Let 

S^z^ 1 ) = {V -1 : s T = z T Vr G {t 1 : z t > ^ Z }} 

denote all possible state history based on the observation history z t ~ 1 . Let C((f), S(z t ^ 1 ), s t ) 
denote the total number of state histories that lead to the posterior count (j> from the initial 
condition that all counts are equal one, and we call it the appearance count to distinguish from 
the posterior count </>. Hence, 

= p (z** 1 , s t \p, q)P(p, q) = Y1 p 0 t_ W q)P(p, q) 

s t~i 

F ( st \p>q) F (Piq) 

s t ~ 1 eS(z t ~ 1 ) 

= J2 <?(</>, Si)/ 1_1 ( 1 - p) (t>2 ~ 1 q 4,3 ~ 1 {l ~ q)* 4 ~\ 

<t> 

which can be written as 


P{s u p,q\z l X ) = ^P(0,s t |z* 1 )P(p, q\<f>), 

<t> 

where 

p( , s , 4 giMhPwgLLM 

w. «i* > p( z t-i)r(^ 1 + ^)r(^3 + 0 4 )' 

Therefore, the posterior P(s t ,p, qlz^ 1 ) can be seen as a probability distribution over the energy 
state st and the posterior count 0. Furthermore, the posterior can be fully described by each 
appearance count C associated with the posterior count 0 and the energy state s t , up to the 
normalization term P(V -1 ). 

When we have a new observation z t at time t, the posterior at time t + 1 is updated in a 
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recursive form as follows 

P{s t+1 ,p,q\ zt ) = P(s t+ i,p,g|z t-1 ,3 t ) 

= q, st+ilz*- 1 , z t ) 

St 

= X p ( s t>.P> Q, s t+ 1, ^|^“ 1 )/P(^|^“ 1 ) 

St 

= Y P ( s t)P> g|^ t_1 )P(s t+ i, 2t|s t ,p, q, z t ~ 1 )/F(z t \z t ~ 1 ) 

St 

= Y F ( s t,P, g|^ t_1 )P(s t+ i, zt\s t ,p, q)/F(z t \z l ~ l ), 

St 

where F(z t \z t ~ 1 ) is the normalization term. 

If we harvest and observe the exact state, the total number of possible posterior counts will 
remain the same. For example, if we harvest and observe that z t = G, this implies that s t = G. 
The posterior for s t+ 1 = B is then 

P^gl^PO^- 1 ) 

= F(G,p,q\z t - 1 )F(B\G,p,q) 

= Y P (0> G l^ _1 ) P (P> ?|01 + !, 02, 03, 04)- 

<t> 

This update has the simple form that we take the posterior count (f) associated with G state at 
the previous update, and increase the posterior count fa by one. On the other hand, the total 
number of possible posterior counts will be at most multiplied by two for the sleeping action. 
For example, if the action is to sleep, i.e., z t = Z, then we have to iterate over two possible 
states at time t since we do not know the exact state. The posterior for s t +i = B is then 

P^gl^PO^- 1 ) 

= X] F ( s t,PBl\z t ^ 1 )F(B\s tl p,q) 

st£{G,B} 

= [ X P (0’ G \ ** -1 ) P (P> 9101 + !, 02, 03, 04) 

+ X P (0’ B \ * t_1 ) p (p, 9|01> 02, 03, 04 + 1)] • 

<t> 
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The updates in other scenarios can be defined similarly. An example of the update of the 
appearance count is shown in Figure [3j Note that two previously different posterior counts 



Fig. 3. A belief-update example after two sleeping actions and one harvesting action with good state observation. The numbers 
in the rectangle denote respectively the energy state (G or B), the posterior count rf) and the appearance count G. 


could lead to the same value after one update, in which we simply add their appearance count. 


B. Extended POMDP formulation of the Bayesian framework 

The problem is then to derive an optimal policy in order to maximize the expected reward based 
on the current posterior distribution of the energy states and the state transition probabilities, 
obtained via the Bayesian framework described. This has been shown to be equivalent to deriving 


an optimal policy in an extended POMDP [20]. 

In what follows, we will show the detailed formulation of the POMDP. In the POMDP, the 
state space is {G,B} x Zi that denotes the energy state and the posterior count o of the 
Beta distribution. The action space and the reward function do not change. For brevity, we let 
I t — {st-i,f, at}. Recall that the state of this POMDP is By the formula of condi¬ 

tional probability and the independence assumptions, the joint state transition and observation 
probability is 


V(s t ,f',z t \I t ) = F(s t \I t )F(z t \I t , s t )F((j)'\I t , s t: z t ) 

= P(s t |s t _i, f)F{zt\s t )F{(j)'\st_i, <j), St), 

where P(z t |s t ) = 1 if z t — s t , and P(0'|st_i, f, s t ) = 1 if the change of state from s t ~ i to s t 
leads to the corresponding update of f to ft. Lastly, the transition P(s t |s f _i,0) is derived from 
the average p and q associated with the posterior count ©. For example, if s,_i = G and s t = B, 
then P(s t |s t _i,0) = fi/ifi + ©2)- Therefore, the problem of deriving the optimal policy in 
the Bayesian framework can be solved by techniques developed for the POMDP. The optimal 
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policy tackles the exploration and exploitation dilemma by incorporating the uncertainty in the 
transition probabilities in the decision making processes. 


C. Heuristic learning algorithm based on posterior sampling 

It is computationally difficult to solve the extended POMDP exactly due to its large state space. 
More precisely, during the Bayesian update, we keep the appearance count of all the possible 
posterior count 0 and the energy state (G or B). The challenge is that the number of possible 
posterior count 0 is multiplied by two after the sleeping action, and it can grow to infinity. One 
approach could be to ignore the posterior update with the sleeping action, and the number of 
posterior count is kept constant at two. However, this approach is equivalent to heuristically 
assuming that the unknown energy state is kept the same during the sleeping period. 

Instead, we propose the heuristic posterior sampling algorithm [2] inspired by pO], [25]. The 


idea is to keep the K posterior counts that have the largest appearance count in the Bayesian 
update. If the energy state was in good state, then we keep harvesting. If the energy state was in 
bad state, then we get a sample of transition probabilities from the posterior distributions, and 
find the optimal sleeping time corresponding to the sampled transition probabilities. The idea 
leverages on the fact the optimal policy with respect to a given set of transition probabilities is 
threshold-based and can be pre-computed off-line. 

More precisely, the algorithm maintains the value ip G = [0i, 02, 03, 04, n] that denotes the 
appearance count n that leads to the posterior count [01,02,03,04] and the good state. The 


value ip B is defined similarly. The two procedures in Line 22 and Line 24 show the computation 
of the update of the posterior count and appearance count with good and bad state observations, 
respectively. We uniformly pick a posterior count according to their appearance counts shown 
in Line [9] to reduce computational complexity. The transition probability is taken to be the mean 


of the Beta distribution corresponding to the sampled posterior count as shown in Line 10 


Lastly, with the sleeping action, we have to invoke both good state and bad state updates in 
Line [T5] and [16j since the state is not observed. 
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Algorithm 2: Posterior-sampling algorithm 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


Input: r, 7 , K, optimal policy lookup table 

1 Initialization: Let sleeping time w = 0 

2 while true do 
if sleeping time w — 0 then 

Harvest energy 

if Successfully with good state then 
Good State Update!) 

Sleeping time w — 0 
se 

Bad State Update!) 

Draw 0 G or 'ip B proportional to the count n 
Let p = 0l/(0l + 0 2 ), q = 03/(03 + 04) 
Find sleeping time w from the lookup table 

end 


se 


Sleep and decrease sleeping time w = w — 1 

Good State Update!) 

Bad State Update!) 


end 


Merge 0 G and -0 s with same posterior count by summing appearance count n 
Assign 2K items of 0 G and 0 B with the highest number of n to 0 G and f> B , 
respectively. 

21 end 

22 Procedure Good State Update!) 

23 For each 0 G , generate new 0 G such that 0 G (0 2 ) = 0 G (0 2) + 1 and new 0 s such that 

0 B (0l) = 0 G (0l) + 1 

24 Procedure Bad State Update!) 

25 For each 0 s , generate new 0 G such that 0 G (0 3 ) = 0 G (0 3 ) + 1 and new 0 B such that 

0 S (0 4 ) = 0 G (04) + 1 


V. Numerical Examples 
A. Known transition probabilities 

In the case of known transition probabilities of the Markov chain model, the optimal energy 
harvesting policy can be fully characterized by the sleeping time after an unsuccessful harvesting 
attempt (cf. Proposition [I]). For different values of reward and cost, we show in Figure |4]-[6] the 
optimal sleeping time, indexed by the average number of time slots the model stays in the bad 
harvesting state T B = 1 /q and the probability of being in the good state IL; = q/(p + q). Note 
that the bottom-left region without any color corresponds to the case 1 — p > q. The region 
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Fig. 4. Optimal sleeping time with n = 10, r*o = 1 and 7 = 0.99. 



Fig. 5. Optimal sleeping time with r± = 10, 7*0 = 10 and 7 = 0.99. 

with black color denotes the scenario in which it is not optimal to harvest any more after an 
unsuccessful harvesting. 

From these figures, we first observe the natural monotonicity of longer sleeping time with 
respect to longer burst lengths and smaller success probabilities. Moreover, the optimal sleeping 
time depends not only on the burst length and the success probability, but also depends on 
the ratio between the reward r\ and the penalty r 0 . One might be mislead to believe that if the 
reward is much larger than the cost, then the optimal policy should harvest all the time. However, 
Figure [4] shows that for a rather large parameter space, the optimal policy is to sleep for one 
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Fig. 6 . Optimal sleeping time with ri = 1, ro = 10 and 7 = 0.99. 


or two time slots after an unsuccessful harvesting. On the other hand, when the cost is larger 
(i.e. larger r 0 ), it is better not to harvest at all in a larger parameter space. Nevertheless, there 
still exists a non-trivial selection of the sleeping time to maximize the total harvested energy as 
shown in Figure |6j Figure [7] shows that the accumulated energy can be significant. 



Fig. 7. Maximum harvested energy with r i = 1, ro = 10 and 7 = 0.99. 


In these numerical examples, we let the reward 77 and the penalty r 0 be close, and the ratio 


is between 0.1 and 10. We believe such choices are practical. For example, in AT86RF231 261 


(a low power radio transceiver), it can be computed that sensing channel takes 3//./ energy 
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since one clear channel assessment takes 140/zs and the energy cost for keeping the radio on is 
22 mW. Moreover, the energy harvesting rate of the current technology is around 200/iW 0. 
(27). Suppose the coherence time of the energy source is T milliseconds, which corresponds to 
the duration of the time-slot. The ratio ri/r 0 is roughly (0.2T — 3)/3, and it ranges from 0.3 
to 10 if T e [20, 200] milliseconds. Therefore, the ratio between the reward rq and the penalty 
r 0 is neither too large nor too small, and the POMDP and the threshold-based optimal policy is 
very useful in practice to derive the non-trivial optimal sleeping time. 

Recall that the threshold-based optimal policy in Proposition [T] induces a discrete-time Markov 
chain with state (S,£) which denotes the energy arrival state at the previous time-slot and the 
energy level at the current time-slot, respectively. Note that, once the battery is completely 
depleted, we cannot turn on the radio to harvest anymore, which corresponds to the absorbing 
states (S', 0) for any S in this Markov chain. Suppose the maximum energy level is £, which 
introduces the other absorbing states (S', £) for any S. Without loss of generality, we assume the 
energy level in the battery is a multiple of the harvested energy at each time-slot and the cost 
for an unsuccessful harvesting. Hence, this Markov chain has a finite number of states, and we 
can derive some interesting parameters by standard analysis tools from the absorbing Markov 
chain theory [ [28) . 

Figure [8] shows the full-charge probability under a hypothetical energy harvesting device with 
average success energy arrival probability equal 0.7 and under different initial energy levels. We 
assume that the maximum battery level is 100 units, and one successful harvesting accumulates 
one unit of energy while one unsuccessful harvesting costs one unit of energy. The plots can 
guide us in designing appropriate packet transmission policies. For example, for the case of burst 
length equal 10, we should restrain from transmitting the packet once the battery is around 20% 
full if we want to keep the depleting probability smaller than 5 • 10 -4 . 

Lastly, Figure [9] shows the average number of time-slots to reach full-charge if the device 
manages to fully charge the battery, under different initial energy levels and average burst lengths. 
The figure shows a decreasing and almost linear relation between the initial energy level and the 
average number of time-slots when the initial energy level becomes larger. Similarly, the slope 
of these numbers can help us determine whether we can expect to be able to support a sensor 
application with a specified data transmission rate. Suppose the cost for one packet transmission 
is 40. If the data rate is larger than one packet per 50 time slots, the energy harvesting device 
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Fig. 8. The full-charge probability under different initial energy levels and average burst length. 


would quickly deplete the battery, since it takes more than 50 time slots to harvest 40 units of 
energy. On the other hand, if the data rate is smaller than one packet per 100 time slots, then 
we are confident that it can support such applications. 
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Fig. 9. The expected number of time-slots to reach full-charge under different initial energy levels and average burst length. 


B. Unknown transition probabilities 

In this section, we demonstrate the performance of the Bayesian learning algorithm. Figure [T0| 
shows that the performance of Algorithm [2] outperforms other heuristic learning algorithms in 
terms of the total discounted reward. The results are averaged over three hundred independent 
energy arrival sample paths generated from the unknown Markov chain, and for each sample 
path the rewards are averaged over one hundred independent runs. In the heuristic posterior 
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sampling method, the posterior count is only updated when we have an observation of the state 
transition (i.e., two consecutive harvesting actions that both reveal the state of the Markov chain). 
In the heuristic random sampling method, we replace Line [9] and Line 10 in Algorithm [2] with 
a uniformly selected set of parameters p and q. Because of the heuristic choice of keeping only 
K posterior counts, the Bayesian update is not exact and the parameter estimation is biased. 
However, its total reward still outperforms others as a result of its smarter exploration decisions 
during the learning phase. Note also that due to the discount factor 7 being strictly smaller than 
one, the reward and the penalty after five hundred time-slots are negligible compared to the 
already accumulated rewards. 
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Fig. 10. Total rewards with different algorithms with IT g — 0.6, Tb = 2.5, r*o = 10, n = 10, 7 = 0.99, K = 20. 


VI. Conclusions and Future Work 

A. Conclusions 

In this paper, we studied the problem of when a wireless node with RF-EH capabilities should 
try and harvest ambient RF energy and when it should sleep instead. We assumed that the overall 
energy level is constant during one time-slot, and may change in the next time-slot according to 
a two-state Gilbert-Elliott Markov chain model. Based on this model, we considered two cases: 
first, we have knowledge of the transition probabilities of the Markov chain. On these grounds, 
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we formulated the problem as a Partially Observable Markov Decision Process (POMDP) and 
determined a threshold-based optimal policy. Second, we assumed that we do not have any 
knowledge about these parameters and formulated the problem as a Bayesian adaptive POMDP 
To simplify computations, we also proposed a heuristic posterior sampling algorithm. Numerical 
examples have shown the benefits of our approach. 

B. Future Work 

Since energy harvesting may result in different energy intakes, part of our future work is to 
extend the Markov chain model to account for as many states as the levels of the harvested 
energy and in addition to include another Markov chain that models the state of the battery. 

The problem of harvesting from multiple channels is of interest when considering multi¬ 
antenna devices. The formulation of this problem falls into the restless bandit problem framework 
and left for future work. 

Finally, part of our ongoing research focuses on investigating what can be done when the 
parameters of the Markov chain model change over time. 
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